Benjamin Reisman
MD/PhD Candidate
Vanderbilt University
March 22, 2019
How to following along:
“R is a free software environment for statistical computing and graphics”
Compared to Commercial Flow Cytometry Software, R has the following advantages:Example: The iris dataset: measurements of 50 flowers of 3 species of iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
Data Frame: An \(n*m\) array of items, but each column can be a different class
## [1] "data.frame"
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] 5.1 3.5 1.4 0.2
## [2,] 4.9 3.0 1.4 0.2
## [3,] 4.7 3.2 1.3 0.2
## [4,] 4.6 3.1 1.5 0.2
## [5,] 5.0 3.6 1.4 0.2
## [6,] 5.4 3.9 1.7 0.4
## num [1:150, 1:4] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
To work with data in R, it’s best to have ‘tidy data,’ which meets the following criteria:
…but cytometry data is not usually tidy.
For more information, see: Wickham, Hadley. “Tidy data.” Journal of Statistical Software 59.10 (2014): 1-23.
A number of specialized classes have been developed to represent high dimensional bioinformatics data:
SummarizedExperiment - created to represent genetic data (RNAseq, microarray, etc…)flowcore (RGlab)
FlowFrame - Representation of an FCS file in RFlowSet - Container for multiple FlowFrames + MetadataflowWorkspace (RGlab)
GatingSet- A FlowSet + associated gating hierarchyA cytometry experiment may include:
… but those aren’t neatly represented in R:
| Traditional Object | FlowCore Object | R Equivalent |
|---|---|---|
| FCS File | FlowFrame | Matrix |
| Bunch of FCS File | FlowSet | List of matrices + pData |
| Gated Experiment | Gatingset | - |
None of these are a data.frame, the most flexible datatype in R
fetchCytobankExperimentparseCytobankExperimentcytoML::cytobankExperimentas.data.frame.flowFrameFirst we’ll need to load a few packages…
library(CytobankAPI) #connects to cytobank
library(flowWorkspace)#loads flowcore, flowWorkspace
library(CytoML) #Used to read in gating files
library(cytotidyr) #for importing cytobank experiments, and tidying
library(dplyr) #for manipulating data
library(tidyr) #for rearranging data from wide to long
library(ggplot2)and find our files…
## [1] "KCL Guys Data/20180321-01 Group 1 Helios B Post-viSNE_Ungated.fcs.density.fcs.cluster.fcs"
## [2] "KCL Guys Data/20180321-02 Group 2 Helios A Post-viSNE_Ungated.fcs.density.fcs.cluster.fcs"
## [3] "KCL Guys Data/20180321-03 Group 3 Helios A Post-viSNE_Ungated.fcs.density.fcs.cluster.fcs"
## [4] "KCL Guys Data/20180321-04 Group 4 Helios B Post-viSNE_Ungated.fcs.density.fcs.cluster.fcs"
Using CytobankAPI and Cytotidyr we’ll read in our experiment information from cytobank. This includes:
token <- "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqdGkiOiJhNzE5YzU0MTU0OGM0ZDEzMzI3NjE4MGQzYmM0ZGJmMyIsImV4cCI6MTU1MzIxNDM2NSwidXNlcl9pZCI6MTQ3LCJhdWQiOiJjeXRvYmFua19hcGlfdjFfdXNlcnMiLCJpYXQiOjE1NTMxODU1NjUsImlzcyI6Imh0dHBzOi8vdmFuZGVyYmlsdC5jeXRvYmFuay5vcmcvIiwibmJmIjoxNTUzMTg1NTY1LCJzdWIiOiJjeXRvYmFua19hcGlfdjEifQ.uXJfnd0e7FbJNBSI_YPxOsY5GsRNkMfhmhsuE92pliQ"
cyto_session <- authenticate("vanderbilt", auth_token = token)
experiment.id <- 29958
exp_info <- fetchCytobankExperiment(cyto_session, experiment.id)First we’ll read in the data as a flowSet
Then we’ll convert it to a gatingSet
## ....done!
Next we’ll:
mygatingset <- flowWorkspace::transform(mygatingset, exp_info$transforms)
markernames(mygatingset) <- exp_info$panels$`Panel 1`
CytoML::gating(exp_info$gates, mygatingset)## non-beads
## eventlength_width
## center_width
## residual_width
## DNA
## viable
## B Cells
## T Cells
## CD4s
## CD8s
## ....done!
In order to work with our data using R, we’ll need to convert it to a data frame, using the as.data.frame function from cytotidyr
## 'data.frame': 57684 obs. of 79 variables:
## $ Time : num 86.5 96.9 143.5 444.2 544 ...
## $ Event_length : num 15 15 15 22 17 19 17 15 19 16 ...
## $ 89Y_CD45 (v) : num 3.96 4.38 4.39 3.74 4.21 ...
## $ 102Pd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 103Rh_Viability (v): num 0 0 0 0 0.464 ...
## $ 104Pd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 105Pd : num 0 0 0 0 0 ...
## $ 106Pd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 108Pd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 110Pd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 113In : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 114Cd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 115In : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 120Sn : num 0 0 0.014 0 0 ...
## $ 127I : num 0 0 0 0 0 ...
## $ 131Xe : num 0 0 0 0 0 ...
## $ 133Cs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 138Ba : num 1.2 1.72 1.66 1.84 1.85 ...
## $ 139La : num 0 0 0 0 0 ...
## $ 140Ce_EQ_Beads (v) : num 0.238 0 0 0 0 ...
## $ 141Pr_CCR6 (v) : num 0 0 0 0 0 ...
## $ 142Nd : num 0.268 0 0.376 0 0 ...
## $ 143Nd : num 0.766 0 0 0 0 ...
## $ 144Nd : num 0.4031 0.3158 0.0934 0 0.1101 ...
## $ 145Nd_CD4 (v) : num 0.65661 0.00594 0.87242 3.19221 3.99948 ...
## $ 146Nd_CD8 (v) : num 5.77 2.32 5.62 1.26 0 ...
## $ 147Sm_CD20 (v) : num 0 0 0.0314 0 0 ...
## $ 148Nd_CD16 (v) : num 0.1981 3.3858 0.4623 0.0712 0 ...
## $ 149Sm_CCR4 (v) : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 150Nd : num 0 0 0 0 0 ...
## $ 151Eu : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 152Sm : num 0 0 0.0263 0.1204 0.2434 ...
## $ 153Eu : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 154Sm_CD3 (v) : num 4.84 0 5.26 4.47 5.04 ...
## $ 155Gd_CD45RA (v) : num 3.299 3.014 4.374 0 0.841 ...
## $ 156Gd : num 0 0 0 0 0.115 ...
## $ 157Gd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 158Gd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 159Tb_CCR7 (v) : num 2.34 0 3.22 2.28 3.32 ...
## $ 160Gd_CD14 (v) : num 0 0 0.0252 0 0 ...
## $ 161Dy : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 162Dy : num 0.844 0 0.889 0 0 ...
## $ 163Dy_CXCR3 (v) : num 0.0603 0 3.2913 0 0.9788 ...
## $ 164Dy : num 0 0.415 0 0 0 ...
## $ 165Ho_CD45RO (v) : num 0 0 0 0 0 ...
## $ 166Er : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 167Er_CD27 (v) : num 3.09 0 3.72 2.56 3.22 ...
## $ 168Er : num 0.0415 0 0.2607 0 0.373 ...
## $ 169Tm_CD25 (v) : num 0 0 0 1.845 0.677 ...
## $ 170Er : num 0 0 0 0.0508 0 ...
## $ 171Yb : num 0 0 0 0 0 ...
## $ 172Yb_CD57 (v) : num 0 0 0 0 0 ...
## $ 173Yb : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 174Yb_HLA-DR (v) : num 0 0.314 0.51 0 0 ...
## $ 175Lu : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 176Yb_CD127 (v) : num 1.81 0 2.26 0.7 1.96 ...
## $ 177Hf : num 0 0 0 0 0 0 0 0 0 0 ...
## $ 190BCKG : num 0 0 0.152 0 0 ...
## $ 191Ir_DNA (v) : num 6.06 6.31 6.33 6.21 6.15 ...
## $ 193Ir : num 6.67 6.89 6.93 6.73 6.74 ...
## $ 194Pt : num 1.418 1.301 0.744 0.742 0.499 ...
## $ 195Pt : num 0 0 0 0 0.419 ...
## $ 196Pt : num 0 0.53 0 0 0 ...
## $ 198Pt : num 0 0.143 0 0 0 ...
## $ 207Pb : num 0 0 0 1.14 0 ...
## $ 208Pb : num 0.212 0 0 0.591 0 ...
## $ 209Bi : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Center : num 5.37 5.48 5.32 5.84 5.46 ...
## $ Offset : num 3.53 3.74 3.54 3.73 3.61 ...
## $ Width : num 2.71 2.93 2.73 3.29 2.93 ...
## $ Residual : num 3.23 3.17 3.36 3.32 3.4 ...
## $ tSNE1 : num 24.4 15.83 19.45 -7.96 -13.08 ...
## $ tSNE2 : num -7.749 16.926 0.364 -5.673 -3.741 ...
## $ density : num 7.63 6.83 6.57 8.48 8.47 ...
## $ cluster : num 45 57 53 52 67 5 128 185 45 5 ...
## $ FCS Filename : chr "20180321-04 Group 4 Helios B Post-viSNE_Ungated.fcs.density.fcs.cluster.fcs" "20180321-04 Group 4 Helios B Post-viSNE_Ungated.fcs.density.fcs.cluster.fcs" "20180321-04 Group 4 Helios B Post-viSNE_Ungated.fcs.density.fcs.cluster.fcs" "20180321-04 Group 4 Helios B Post-viSNE_Ungated.fcs.density.fcs.cluster.fcs" ...
## $ Individuals : Factor w/ 4 levels "Group 1","Group 2",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Plate : Factor w/ 1 level "Plate 1": 1 1 1 1 1 1 1 1 1 1 ...
## $ FCS.File.Category : Factor w/ 1 level "Experiment Files": 1 1 1 1 1 1 1 1 1 1 ...
One thing we may want to do is reproduce the same t-SNE figure we made on cytobank:
ggplot(mydataframe, aes(x = tSNE1, y = tSNE2)) +
geom_point(shape = ".") +
coord_fixed() +
facet_wrap(~Individuals) We can also customize our plots in ways that are not easy to do in cytobank:
ggplot(mydataframe, aes(x = tSNE1, y = tSNE2)) +
geom_point(shape = 16, alpha = 0.2, size = 0.2) +
coord_fixed() +
facet_wrap(~Individuals) +
theme_minimal() +
theme(axis.text = element_blank(),
panel.background = element_rect(color = "black", fill = NA))We may also want to plot multiple channels in the same plot with faceting
## [1] 57684 79
element2 <- function(x){unlist(lapply(strsplit(x, split = "_|\ "),"[[", 2))}
mydataframe.long <- mydataframe %>%
as_tibble() %>%
gather(marker, intensity, contains("(V)")) %>% # <- this is the key step
separate(marker, c("channel", "marker", "drop"), sep= "_|\ ") %>%
as_tibble()
dim((mydataframe.long))## [1] 1211364 62
Then we’ll make our plot:
mydataframe.long %>%
ggplot(aes(x = tSNE1, y = tSNE2, col = intensity)) +
geom_point(shape = ".") +
scale_colour_viridis_c(option = "A") +
coord_fixed() +
facet_wrap(~marker, nrow = 3) +
theme_minimal() +
theme(axis.text = element_blank(),
panel.background = element_rect(color = "black", fill = NA))One of the advantages of R is that we’re not limited to the dimensionality reduction techniques that are included in commercial packages.
First we’ll need to create a separate matrix containing the columns we want to be included in the dimensionality reduction.
Then we’ll run it through the uwot implementation of UMAP
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
## The following object is masked from 'package:flowCore':
##
## %&%
## num [1:57684, 1:2] 1.673 10.16 0.423 -4.02 -3.494 ...
## - attr(*, "scaled:center")= num [1:2] -0.0208 -0.0843
Next, we’ll rejoin the two new UMAP columns to our original dataframe, and make our plot:
mydataframe %>%
bind_cols(as.data.frame(myumap)) %>%
as_tibble() %>%
gather(marker, intensity, contains("(V)")) %>% # <- this is the key step
separate(marker, c("channel", "marker", "drop"), sep= "_|\ ") %>%
ggplot(aes(x = UMAP1, y = UMAP2, col = intensity)) +
geom_point(shape = ".") +
scale_colour_viridis_c(option = "A") +
coord_fixed() +
facet_wrap(~marker, nrow = 3) +
theme_minimal() +
theme(axis.text = element_blank(),
panel.background = element_rect(color = "black", fill = NA))We can also plot our data as a map:
library(scico)
axis.max <- apply(myumap, 2, max) + 1
axis.min <- apply(myumap, 2, min) - 1
mydataframe %>%
bind_cols(as.data.frame(myumap)) %>%
ggplot(aes(x=UMAP1, y = UMAP2)) +
stat_density_2d(h = c(1, 1),
n = 1024,
geom = "raster",
contour = F,
aes(fill = stat(density))) +
scale_fill_scico(palette = "oleron", name = "density", trans = "sqrt") +
scale_x_continuous(expand = c(0,0), limits = c(axis.min[1], axis.max[1])) +
scale_y_continuous(expand = c(0,0), limits = c(axis.min[2], axis.max[2])) +
theme_minimal() +
coord_fixed()We can also apply a clustering algorithm to our dimensionality reduced data.
#install.packages("dbscan")
library(dbscan)
library(scales)
mydbscan <- dbscan(myumap, eps = 0.35, minPts = 150)
mydbscan## DBSCAN clustering for 57684 objects.
## Parameters: eps = 0.35, minPts = 150
## The clustering contains 13 cluster(s) and 619 noise points.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 619 4268 5110 9011 24409 4317 1768 3996 1498 929 808 544
## 12 13
## 220 187
##
## Available fields: cluster, eps, minPts
#this finds the number of clusters and manually defines the palette
#such that the outlier cluster is "grey50"
nclust <- max(unique(mydbscan$cluster))
mypalette <- c("grey50", hue_pal()(nclust))
mydataframe %>%
bind_cols(as.data.frame(myumap)) %>%
mutate(cluster = as.factor(mydbscan$cluster)) %>%
ggplot(aes(x=UMAP1, y = UMAP2, col = cluster)) +
geom_point(shape = ".") +
scale_colour_manual(guide = guide_legend(override.aes = list(shape = 15)),
values = mypalette) +
coord_fixed() +
theme_minimal() +
theme(panel.border = element_rect(color = 'black', fill = NA))We have clusters, but how can we understand what makes them distinct?
##
## Attaching package: 'tibble'
## The following object is masked from 'package:flowCore':
##
## view
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
myheatmap <- mydataframe %>%
bind_cols(as.data.frame(myumap)) %>%
mutate(cluster = as.factor(mydbscan$cluster)) %>%
gather(marker, intensity, contains("(V)")) %>% # <- this is the key step
group_by(cluster, marker) %>%
summarise(MFI = median(intensity)) %>%
select(marker, MFI, cluster) %>%
spread(marker, MFI)
myheatmap.mat <- myheatmap %>%
ungroup() %>%
column_to_rownames("cluster") %>%
as.matrix()
matrix.dist.row <- dist((myheatmap.mat))
matrix.dist.col <- dist(t(myheatmap.mat))
row.order <- seriation::get_order(seriate(matrix.dist.row, method = "HC"))
col.order <- seriation::get_order(seriate(matrix.dist.col, method = "HC"))
myheatmap %>%
ungroup() %>%
gather(marker, MFI, -cluster) %>%
mutate(marker = factor(marker, levels = colnames(myheatmap.mat)[(col.order)])) %>%
mutate(cluster = factor(cluster, levels = rownames(myheatmap.mat)[(row.order)])) %>%
ggplot(aes(x=cluster, y = marker,fill = MFI)) +
geom_tile(colour = "grey90", size = 0.5) +
scale_fill_viridis_c(option = "E") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
coord_fixed()Here’s an example of a MEM Heatmap of the same data. For more information about MEM see:
library(MEM)
library(tibble)
library(stringr)
mymem <- MEM(cbind(mymatrix, cluster = mydbscan$cluster))
matrix.dist.row <- dist((mymem$MEM_matrix[[1]]))
matrix.dist.col <- dist(t(mymem$MEM_matrix[[1]]))
row.order <- seriation::get_order(seriate(matrix.dist.row, method = "HC"))
col.order <- seriation::get_order(seriate(matrix.dist.col, method = "HC"))
mymem$MEM_matrix[[1]] %>%
as.data.frame() %>%
rownames_to_column("cluster") %>%
as.tibble() %>%
gather(marker, mem, -cluster) %>%
# mutate(marker = factor(marker, levels = colnames(mymem$MEM_matrix[[1]])[(col.order)])) %>%
# mutate(cluster = factor(cluster, levels = rownames(mymem$MEM_matrix[[1]])[(row.order)])) %>%
mutate(marker = substr(str_extract(marker, "\\_\\w*"), 2, 100)) %>%
ggplot(aes(x=cluster, y = marker, fill = mem)) +
geom_tile(colour = "grey90", size = 0.5) +
scale_fill_scico(palette = 'lisbo', limits = c(-10,10),
name = "MEM Score") +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0)) +
coord_fixed()FIt-SNE
Here’s an example of FIt-SNE Running. To install it and run it yourself, follow the instructions on the FIt-SNE github page
orig.wd <- getwd()
fitsne_scirpt <- "C:/Users/benja/Downloads/FIt-SNE-master/FIt-SNE-master/fast_tsne.R"
setwd(dirname(fitsne_scirpt))
#?basename
source(fitsne_scirpt)## FIt-SNE R wrapper loading.
## FIt-SNE root directory was set to C:/Users/benja/Downloads/FIt-SNE-master/FIt-SNE-master
## [1] 1
## Time difference of 2.014717 mins
colnames(myfitsne) <- c("FItSNE1", "FItSNE2")
as_tibble(myfitsne) %>%
bind_cols(as_tibble(mydataframe)) %>%
ggplot(aes(x=FItSNE1, y = FItSNE2, col = `154Sm_CD3 (v)`)) +
geom_point(shape = ".", size = 2) +
scale_color_viridis_c(option = "A", name = "CD3") +
coord_fixed() +
facet_wrap(~Individuals) +
theme_minimal() +
theme(axis.text = element_blank(),
panel.background = element_rect(color = "black", fill = NA))